Optimal Data Partitioning Shape for Matrix Multiplication on Three Fully Connected Heterogeneous Processors
نویسندگان
چکیده
Parallel Matrix Matrix Multiplication (MMM) is used in scientific codes across many disciplines. While it has been widely studied how to optimally divide MMM among homogenous compute nodes, the optimal solution for heterogeneous systems remains an open problem. Dividing MMM across multiple processors or clusters requires consideration of the performance characteristics of both the computation and the communication subsystems. The degree to which each of these affects execution time depends on the system and the algorithm used to divide, communicate, and compute the MMM data. Our previous work has determined the optimum shape must be, for all ratios of processing power, communication bandwidth and matrix size, one of six well-defined shapes for each of the five MMM algorithms studied. This paper further reduces the number of potentially optimal candidate shapes to three defined shapes known as Square Corner, Square Rectangle, and Block Rectangle. We then find, for each algorithm and all ratios of computational power among processors, ratios of overall computational power and communication speed, and problem size, the optimum shape. The Block Rectangle, a traditional 2D rectangular partition shape, is predictably optimal when using relatively homogeneous processors, and is also optimal for heterogeneous systems with a fast, medium and slow processor. However, the Square Corner shape is the optimum for heterogeneous environments with a powerful processor and two slower processors, and the Square Rectangle is optimal for heterogeneous environments composed of a two fast processors and a single less powerful processor. These theoretical results are confirmed using a series of experiments conducted on Grid’5000, which show both that the predicted optimum shape is indeed optimal, and that the remaining two partition shapes perform in their predicted order.
منابع مشابه
Two-Dimensional Matrix Partitioning for Parallel Computing on Heterogeneous Processors Based on Their Functional Performance Models
The functional performance model (FPM) of heterogeneous processors has proven to be more realistic than the traditional models because it integrates many important features of heterogeneous processors such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. Optimal 1D matrix partitioning algorithms employing FPMs of heterogeneous processors are alre...
متن کاملMatrix Multiplication on Three Heterogeneous Processors
We present a new algorithm specifically designed to perform matrix multiplication on three heterogeneous processors. This algorithm is an extension of the ‘square-corner’ algorithm designed for two-processor architectures [2]. For three processors, this algorithm partitions data in a way which on a fully-connected network minimizes the total volume of communication (TVC) between the processors ...
متن کاملDistributed Data Partitioning for Heterogeneous Processors Based on Partial Estimation of Their Functional Performance Models
The paper presents a new data partitioning algorithm for parallel computing on heterogeneous processors. Like traditional functional partitioning algorithms, the algorithm assumes that the speed of the processors is characterized by speed functions rather than speed constants. Unlike the traditional algorithms, it does not assume the speed functions to be given. Instead, it uses a computational...
متن کاملColumn-Based Matrix Partitioning for Parallel Matrix Multiplication on Heterogeneous Processors Based on Functional Performance Models
In this paper we present a new data partitioning algorithm to improve the performance of parallel matrix multiplication of dense square matrices on heterogeneous clusters. Existing algorithms either use single speed performance models which are too simplistic or they do not attempt to minimise the total volume of communication. The functional performance model (FPM) is more realistic then singl...
متن کاملData Allocation Strategies for Dense Linear Algebra Kernels on Heterogeneous Two-dimensional Grids
We study the implementation of dense linear algebra computations, such as matrix multiplication and linear system solvers, on two-dimensional (2D) grids of heterogeneous processors. For these operations, 2D-grids are the key to scalability and eÆciency. The uniform block-cyclic data distribution scheme commonly used for homogeneous collections of processors limits the performance of these opera...
متن کامل